第 11 屆 iThome 鐵人賽

DAY 9

Google Developers Machine Learning

AI可以分析股票嗎?系列第 9 篇

TFRecord踩坑踩好踩滿

11th鐵人賽

預計撐兩天XD

2019-09-10 22:31:59

13022 瀏覽

分享至

零、引言

這三天的實作內容在第三天才有結果
這真的跟我想的不太一樣啊啊啊啊啊!!!!
好不容易有的結果最主要還歸功於第三天的 果斷放棄使用TFRecord

什麼鬼? 好爛! 我怎麼能甘心接受呢?

俗話說的好，在哪裡跌倒在哪裡哭 (喂!
今天來個抖M上身，直接裸身踩TFRecord這個坑
讓我們開踩吧 !

PS: 不想踩的可以下拉看結論Orz

一、什麼是TFRecord

TFRecord是一種專為Tensorflow打造的儲存文件的格式，就像是.csv、.md、.pkl、.npy、.json、...等等，他們會有自己的儲存方式，但這種儲存資料的檔案不外乎有一個特點，資料型態大致就那幾種 :

ASCII
- 最為常見，不論.csv、.md或是.json都是這種，就是用我們人能夠讀懂的資料儲存
- 一個字元為8-bit，中文為24-bit
binary
- 通常是為了縮小儲存空間所使用的資料儲存方式，將所有的資料轉為0101訊號儲存，如.pkl、.npy、TFRecord
- 此種資料格式對於儲存數字來說尤其省空間
  - 舉例一個float : 1.08462e-4，在binary中只有32bit，但若是字串則需要 10 * 8 = 80 bit
- binary檔案另外需要編碼器和解碼器才能轉換為我們能理解的語言
Hex
- 常出現於嵌入式系統、傳輸文件(藍芽、WiFi)或是低階語言中，通常用於與計算機底層溝通

那麼上面有提到TFRecord屬於一種binary的儲存格式，那就一定要來比一比binary格式常使用於python的比較吧!

1. 簡單比一比

格式	儲存速度	讀取速度	檔案大小
`.npy`	`最快`	`最快`	`最小`
`.pkl`	中	`最慢`	`最大`
`.tfrecord`	`最慢`	中	中

說起這個.npy的儲存格式，只適用於python的numpy來儲存和讀取，但非常的穩，也是我最喜歡的格式，簡單好用還非常穩定，看著統計圖果不其然是這三者中的佼佼者。反觀雖然.pkl是python的特殊儲存格式，但無論各方面都輸.npy。但有別於npy檔，pkl在儲存非array型的資料比numpy強，較不侷限。所以綜合來說，在對Numpy array熟悉的人來說，選擇.npy是比較好的，但在特殊的情況下，還是可以用.pkl。
以上就是簡單的比較。

等...等一下，是不是忘了什麼? 我們大TFRecord呢?

沒啥好講的啊，就很中規中矩啊。

不...不是吧? 他不是這裡面最困難使用的嗎? 而且還是儲存最慢的耶~

照你這麼講，它豈不比.pkl還爛嗎?
別急，我們先來談談TFRecord的讀取方式

2. TFRecord讀取概念

比較於.npy，一般情況如.npy一樣，是將整個資料讀進內存

想像一下一個有30GB的.npy要被讀入，你的內存空間就要至少30GB。
但是TFRecord不是這麼讀資料的，而是一個步驟一個步驟地讀，此次訓練需要多少就拿多少

請原諒我的美術+小畫家，這不是重點

當然內部的細節交由Google來做，我們只負責使用。
所以接下來讓我們來看看要怎麼用吧!

二、TFRecord使用前準備

tfrecord是以feature的方式儲存陣列資料，且只有以下三種格式
- a、tf.train.BytesList : 位元組資料，可以由下列型態轉換而來
  - string
  - byte
- b、tf.train.FloatList : 浮點數資料，可以由下列型態轉換而來
  - float
  - double
- c、tf.train.Int64List : 倍經度整數，可以由下列型態轉換而來
  - bool
  - enum
  - int32
  - uint32
  - int64
  - uint64
- 此外，這些描述資料都可以透過.SerializeToString序列化為binary-string
通常使用tf.Example建構範例feature，即可避免冗長的程式碼
- 以下為使用範例，只可意會不可言傳 (抱歉今天內容真比較多...

# The number of observations in the dataset.
n_observations = int(1e4)

# Boolean feature, encoded as False or True.
feature0 = np.random.choice([False, True], n_observations)

# Integer feature, random from 0 to 4.
feature1 = np.random.randint(0, 5, n_observations)

# String feature
strings = np.array([b'cat', b'dog', b'chicken', b'horse', b'goat'])
feature2 = strings[feature1]

# Float feature, from a standard normal distribution
feature3 = np.random.randn(n_observations)

def serialize_example(feature0, feature1, feature2, feature3):
    """
    Creates a tf.Example message ready to be written to a file.
    """
    # Create a dictionary mapping the feature name to the tf.Example-compatible
    # data type.
    feature = {
        'feature0': _int64_feature(feature0),
        'feature1': _int64_feature(feature1),
        'feature2': _bytes_feature(feature2),
        'feature3': _float_feature(feature3),
    }

    # Create a Features message using tf.train.Example.

    example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
    return example_proto.SerializeToString()

example_observation = []

serialized_example = serialize_example(False, 4, b'goat', 0.9876)
serialized_example

三、TFRecord在Python的使用

1. 寫入TFRecord

假如使用serialize_example創建了一個tf.Example，則可以參考下方程式碼
- 下方for一筆一筆讀出經過序列畫副程式serialize_example後直接儲存成檔

# Write the `tf.Example` observations to the file.
with tf.io.TFRecordWriter(filename) as writer:
  for i in range(n_observations):
    example = serialize_example(feature0[i], feature1[i], feature2[i], feature3[i])
    writer.write(example)

2. 讀出TFRecord

filenames = [filename]
raw_dataset = tf.data.TFRecordDataset(filenames)
raw_dataset

直接使用take讀出

for raw_record in raw_dataset.take(1):
  example = tf.train.Example()
  example.ParseFromString(raw_record.numpy())
  print(example)

四、TFRecord在`tf.data`API中的使用

tf.data是一個tensorflow的高階API，其方便和強大在於訓練過程中對輸入資料的優化!!

1. 寫入TFRecord

最簡單的方式是使用from_tensor_slices方法，就是告訴API接下來資料要從tensorf片一片一片讀進來

features_dataset = tf.data.Dataset.from_tensor_slices((feature0, feature1, feature2, feature3))
features_dataset

你會看到啥都沒有，此時可以透過take一個一個拿出來

for f0,f1,f2,f3 in features_dataset.take(1):
    print(f0)
    print(f1)
    print(f2)
    print(f3)

再來剩下將其序列化並儲存了，可以使用tf.data.Dataset.map使用
- 但此時注意必須在 graph mode下的tensorflow才能使用

若是在eager模式下，則要透過tf.py_function讓它在python中計算

def tf_serialize_example(f0,f1,f2,f3):
    tf_string = tf.py_function(
        serialize_example,
        (f0,f1,f2,f3),  # pass these args to the above function.
        tf.string)      # the return type is `tf.string`.
    return tf.reshape(tf_string, ()) # The result is a scalar

此時tf_serialize_example(f0,f1,f2,f3)的輸出就是tf.Tensor了
- 確定輸出為tf.Tensor後使用tf.data.Dataset.map


serialized_features_dataset = features_dataset.map(serialize_example)
# serialized_features_dataset = features_dataset.map(tf_serialize_example)
serialized_features_dataset

完成序列化之後就可以將其存為TFRecord

filename = 'test.tfrecord'
writer = tf.data.experimental.TFRecordWriter(filename)
writer.write(serialized_features_dataset)

2. 讀取TFRecord

filenames = [filename]
raw_dataset = tf.data.TFRecordDataset(filenames)
raw_dataset

讀取TFRecord後，就可以使用tf.dataAPI功能(consuming_tfrecord_data)
- 也可以使用take方式，但必須在eager模式下進行(reading_a_tfrecord_file)

filenames = tf.placeholder(tf.string, shape=[None])
dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.map(...)  # Parse the record into tensors.
dataset = dataset.repeat()  # Repeat the input indefinitely.
dataset = dataset.batch(32)
iterator = dataset.make_initializable_iterator()

# You can feed the initializer with the appropriate filenames for the current
# phase of execution, e.g. training vs. validation.

# Initialize `iterator` with training data.
training_filenames = ["/var/data/file1.tfrecord", "/var/data/file2.tfrecord"]
sess.run(iterator.initializer, feed_dict={filenames: training_filenames})

# Initialize `iterator` with validation data.
validation_filenames = ["/var/data/validation1.tfrecord", ...]
sess.run(iterator.initializer, feed_dict={filenames: validation_filenames})

如果輸入資料為多通道資料(如影像)，則要做Dataset.map()(preprocessing_data_with_datasetmap)

五、總整理

我辛辛苦苦寫了許久，但你們肯定都只看這邊嗚嗚嗚

TFRecord寫入和讀出方式有兩種
- in python
- by using tf.data
處理也分兩種模式
- graph mode
- eager mode
預先基礎
- tf.Example
- 建立序列副程式(範例的serialize_example function)

TFRecord	python	tf.data eager	tf.data graph	備註
write	用for透過`序列副程式`寫入	透過`tf.py_function`轉換為序列，`map`導向`序列副程式`，透過`yield`生成序列資料集，使用`TFRecordWriter`寫入	`map`導向`序列副程式`，透過`yield`生成序列資料集，使用`TFRecordWriter`寫入
read	透過`ParseFromString`解碼，使用for和`take`讀取	`TFRecordDataset`導入，直接透過`take`讀取	`TFRecordDataset`導入，`map`導向`_parse_function`，透過`take`讀取	`tf.data`只讀取:consuming_tfrecord_data、`tf.data`讀取且要顯示:reading_a_tfrecord_file、多通道資料:preprocessing_data_with_datasetmap